YORO - Lightweight End to End Visual Grounding

نویسندگان

چکیده

We present YORO - a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task. This task involves localizing, in an image, object referred via natural language. Unlike recent trend literature of using multi-stage approaches that sacrifice speed accuracy, seeks better trade-off between accuracy by embracing single-stage design, without CNN backbone. consumes language queries, image patches, and learnable detection tokens predicts coordinates object, single encoder. To assist alignment text visual objects, novel patch-text loss is proposed. Extensive experiments are conducted on 5 different datasets with ablations design choices. shown to support real-time inference outperform all this class (single-stage methods) large margins. It also fastest VG model achieves best speed/accuracy literature. Code released (Code available at https://github.com/chihhuiho/yoro ).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

End-to-end esophagojejunostomy versus standard end-to-side esophagojejunostomy: which one is preferable?

 Abstract Background: End-to-side esophagojejunostomy has almost always been associated with some degree of dysphagia. To overcome this complication we decided to perform an end-to-end anastomosis and compare it with end-to-side Roux-en-Y esophagojejunostomy. Methods: In this prospective study, between 1998 and 2005, 71 patients with a diagnosis of gastric adenocarcinoma underwent total gastrec...

متن کامل

A Lightweight Secure SIP Model for End-to-End Communication

Session Initiation Protocol (SIP) is a signaling standard approved by IETF for real-time multimedia session establishment. Increasingly wide deployment brings much concern on SIP security. Current solutions for end-to-end signaling security either require user-side powerful performance support for heterogeneous security mechanisms, or assume that trust relationship is transitive and static. Yet...

متن کامل

Comparison of nerve repair with end to end, end to side with window and end to side without window methods in lower extremity of rat

  Abstract   Background : Although, different studies on end-to-side nerve repair, results are controversial. The importance of this method in case is unavailability of proximal nerve. In this method, donor nerves also remain intact and without injury. In compare to other classic procedures, end-to-side repair is not much time consuming and needs less dissection. Overall, the previous studies i...

متن کامل

TST/BTD: An End-to-End Visual Recognition System

We describe a visual recognition system operating on a hand-held device. Feature selection and tracking are performed in real-time, and used to train a template-based classifier during a capture phase prompted by the user. During normal operation, the system scores objects in the field of view based on their ranking. Severe resource constraints have prompted a re-evaluation of existing algorith...

متن کامل

Fault Identification using end-to-end data by imperialist competitive algorithm

Faults in computer networks may result in millions of dollars in cost. Faults in a network need to be localized and repaired to keep the health of the network. Fault management systems are used to keep today’s complex networks running without significant cost, either by using active techniques or passive techniques. In this paper, we propose a novel approach based on imperialist competitive alg...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2023

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-031-25085-9_1